Abstract: This project analyzes the main factors that influence used car prices using a cleaned and filtered data set of used car listings. After converting key variables to numeric form and removing extreme outliers, graphs revealed clear trends, most notably that higher mileage lowers price and a new model year raises it. A multiple linear regression model was then built to measure these effects. The results showed that mileage, model year, brand, model, and accident history all significantly impacted the price. This analysis provides a foundation to help first time buyers better understand fair pricing in the used car market.
Motivation: For many college students, they have either never bought a car before or have never bought a car without the help of their parents. The used car market is a massive sea of confusing, inconsistent, and untruthful decisions. As a result of this, first time buyers risk getting ripped off and overpaying or choosing a vehicle that may not be reliable. By building a data driven regression model to predict used car prices, I aim to give the youth or first time buyers of America a leg up on negotiation and an understanding of what a fair price should look like.
Rows: 4,009
Columns: 12
$ brand <chr> "Ford", "Hyundai", "Lexus", "INFINITI", "Audi", "Acura", …
$ model <chr> "Utility Police Interceptor Base", "Palisade SEL", "RX 35…
$ model_year <int> 2013, 2021, 2022, 2015, 2021, 2016, 2017, 2001, 2021, 202…
$ milage <chr> "51,000 mi.", "34,742 mi.", "22,372 mi.", "88,900 mi.", "…
$ fuel_type <chr> "E85 Flex Fuel", "Gasoline", "Gasoline", "Hybrid", "Gasol…
$ engine <chr> "300.0HP 3.7L V6 Cylinder Engine Flex Fuel Capability", "…
$ transmission <chr> "6-Speed A/T", "8-Speed Automatic", "Automatic", "7-Speed…
$ ext_col <chr> "Black", "Moonlight Cloud", "Blue", "Black", "Glacier Whi…
$ int_col <chr> "Black", "Gray", "Black", "Black", "Black", "Ebony.", "Bl…
$ accident <chr> "At least 1 accident or damage reported", "At least 1 acc…
$ clean_title <chr> "Yes", "Yes", "", "Yes", "", "", "Yes", "Yes", "Yes", "Ye…
$ price <chr> "$10,300", "$38,005", "$54,598", "$15,500", "$34,999", "$…
Which predictors influence used car prices the strongest?
Does including accident history and title status improve the model performance?
Do exterior and Interior colors affect used car prices, or are these mostly just visual differences?
Source: Kaggle (Used Car Prediction Dataset)
Variables - Price (listed sale price)
Milage (mileage of the vehicle)
Model_year (year manufactured)
Brand (manufacturer)
Model (car model)
Accident (accident history)
Clean_title (clean or salvage title)
Exterior color
Interior color
Price Vs Mileage: The milage (mileage) and price columns contained non numeric characters such as “mi.” and “$,” which prevented me from using a numerical analysis. I used R’s gsub() function to remove all these characters and just leave the numerical data. This allowed me to plot a proper graph
Extreme Values: I also came across some extreme values with cars with prices over 500,000 USD and 3 observations being between 1 and 3 Million USD. These values were filtered out using filter() removing rare outliers that allowed for a more meaningful analysis.
Brand and Model Names: The data set contained many brands of cars and 1898 unique model names which made making a clean looking graph impossible. Brands and model names were lumped using fct_lump_n() to keep the most frequent categories and grouping the less common ones into an overall “other” category. This reduced noise on the x-axis and allowed the model to focus on the most important brands and models.
Converting Categorical Variables to Factors: Columns such as brand, model, and clean_title were converted to factors to be properly interpreted in regression analysis. This ensured that R treated these variables as categorical, which is required for multiple linear regression.
cars <- cars %>%
mutate(
milage = as.numeric(gsub("[^0-9]", "", milage)),
price = as.numeric(gsub("[^0-9]", "", price))
)
cars_filtered <- cars %>%
filter(price < 200000, milage < 300000, model_year >=1990) %>%
mutate(
brand = fct_lump_min(brand, min = 50),
model = fct_lump_n(model, n = 15),
clean_title = factor(clean_title),
accident = factor(accident),
log_price = log(price)
)The price was heavily right skewed. Using log(price) imporves the normality of the residuals and satisfies assumptions.
Price vs Mileage: Showed a strong negative trend, meaning that cars with a higher mileage tend to have lower price suggesting a nonlinear relationship. With a regression coefficient of (−7.62 × 10⁻⁶,), meaning for every additional 10,000 miles, the expected price decreases by about 7.6%. This showed to be one of the strongest predictors of used car price.
Price vs Model Year: Showed that newer vehicle tended to have higher costs compared to older vehicles. The regression coefficient for model year, (0.0467), means that for each additional year that the car is newer, the expected price increases by about 4.78%. This aligns with the EDA that newer cars are more expensive
Price vs Title Status: The box plot comparing cars with clean titles vs non clean titles showed that clean title cars tended to have a slightly higher price, but there was a lot of overlap between groups. This was confirmed in the regression. Clean title (p=0.64) was not a significant predictor, meaning that after controlling for predictors such as mileage and model year, title status does not explain for much variation in price.
Price vs Model: The model shows that a car’s model has a substantial impact on it’s price. High end sports cars and luxury vehicles such are Porsche 911’s and BMW M Series are associated with significantly higher prices, especially compared to the base model of the same vehicles. In contrast to this, more common vehicles such as the Mustang GT and Jeep Wrangler, did not show a statistically significant difference in price from the baseline model.
Residuals vs Fitted The Residuals vs Fitted plot evaluates whether the relationship between the predictors and price is linear and whether the residuals have constant variance. The model reveals a strong curved pattern with a dip in the middle, this indicates the presence of influential outliers.
Q-Q Residuals The Q-Q Plot assesses whether residuals follow a normal distribution. In this case, the points deviate sharply from the diagonal line in both tails, indicating extreme outliers in the data. Requiring more cleaning or transforming.
Scale-Location The Scale-Location plot checks whether the residuals exhibit constant variance across levels of the fitted values. The red line slopes upward. This indicates that as the price for a used car goes higher, the error for the model’s prediction increases.
Cook’s Distance The Cook’s Distance plot identifies observations that exert an unusually large influence on the regression estimates. Several observations in the model (833, 2256, and 2708) stand out. These influential observations have the ability to shift the regression coefficients. These points should be investigated further.
This project explored the primary factors that drive used car prices using a cleaned and filtered data set of online listings. Across the exploratory analysis and regression modeling, the two main predictors that stood out were mileage and model year. Vehicles with higher mileage tended to sell for significantly less, whole newer vehicles commanded higher prices. Brand and car model also played a substantial role with luxury and performance vehicles standing on top. Accident history showed small differences when holding other factors constant. The multiple linear regression model largely supported this idea and revealed that a combination of model year, mileage, and car model can explain a substantial amount of variability in used car prices. While the model captured broad pricing patterns and provides reasonable insight for first time car buyers, the diagnostics indicated several signs of non linearity and influential outliers. As a result of this, the model should be interpreted as a baseline more than a full out predictive tool.
While this analysis provided useful insight into predicting used car price, several limitations should be acknowledged. First, the diagnostic plots showed signs of non linearity, suggesting that the regression model did not capture the full relationship in the data, especially when it came to mileage at the higher price levels. Additionally, the data contained several very influential outliers, which may have represented unusually priced vehicles and can disproportionately affect the regression coefficients. The decision to lump brands and models into broader categories helped for simplifying visualizations, but it also removed the detail from less common vehicles, for example specific trims. The data is limited in different factors such as location, trim level, and optional packages (luxury or performance) that may have played a major role in pricing in the real world. Finally, because the data comes from online listing rather than final sale prices, it may reflect higher prices due to the subtraction of the negotiation process which is a very common process when buying a car and often lowers the price at least a little bit.
Jack Sarsen
B.S. Statistics, University of Dayton
Final Project for MTH 369 (Regression and Linear Models)
linkedin.com/jacksarsen
---
title: "Used Car Price Determinents"
output:
flexdashboard::flex_dashboard:
orientation: columns
vertical_layout: fill
theme: journal
source_code: embed
---
```{r setup, include=FALSE}
pacman::p_load(flexdashboard,tidyverse,dplyr,ggplot2,forcats)
```
Introduction
===
Column {.tabset data-width=350}
-----------------------------------------------------------------------
**Abstract:**
This project analyzes the main factors that influence used car prices using a cleaned and filtered data set of used car listings. After converting key variables to numeric form and removing extreme outliers, graphs revealed clear trends, most notably that higher mileage lowers price and a new model year raises it. A multiple linear regression model was then built to measure these effects. The results showed that **mileage, model year, brand, model, and accident history all significantly impacted the price**. This analysis provides a foundation to help first time buyers better understand fair pricing in the used car market.
**Motivation:**
For many college students, they have either never bought a car before or have never bought a car without the help of their parents. The used car market is a massive sea of confusing, inconsistent, and untruthful decisions. As a result of this, first time buyers risk getting ripped off and overpaying or choosing a vehicle that may not be reliable.
By building a data driven regression model to predict used car prices, I aim to give the youth or first time buyers of America a leg up on negotiation and an understanding of what a fair price should look like.
### Dataset
```{r}
cars <- read.csv("./data/used_cars.csv")
glimpse(cars)
```
Column {.tabset data-width=350}
-----------------------------------------------------------------------
### Research Questions
- Which predictors influence used car prices the strongest?
- Does including accident history and title status improve the model performance?
- Do exterior and Interior colors affect used car prices, or are these mostly just visual differences?
### Data Description {data-height=350}
**Source:** Kaggle (Used Car Prediction Dataset)
**Variables**
- Price (listed sale price)
- Milage (mileage of the vehicle)
- Model_year (year manufactured)
- Brand (manufacturer)
- Model (car model)
- Accident (accident history)
- Clean_title (clean or salvage title)
- Exterior color
- Interior color
Data Cleaning
===
Column {.tabset data-width=350}
-----------------------------------------------------------------------
### Data Cleaning
**Price Vs Mileage:** The milage (mileage) and price columns contained non numeric characters such as "mi." and "$," which prevented me from using a numerical analysis. I used R's **gsub() function** to remove all these characters and just leave the numerical data. This allowed me to plot a proper graph
**Extreme Values:** I also came across some extreme values with cars with prices over 500,000 USD and 3 observations being between 1 and 3 Million USD. These values were filtered out using **filter()** removing rare outliers that allowed for a more meaningful analysis.
**Brand and Model Names:** The data set contained many brands of cars and 1898 unique model names which made making a clean looking graph impossible. Brands and model names were lumped using **fct_lump_n()** to keep the most frequent categories and grouping the less common ones into an overall "other" category. This reduced noise on the x-axis and allowed the model to focus on the most important brands and models.
**Converting Categorical Variables to Factors:** Columns such as brand, model, and clean_title were converted to factors to be properly interpreted in regression analysis. This ensured that R treated these variables as categorical, which is required for multiple linear regression.
```{r, echo=TRUE}
cars <- cars %>%
mutate(
milage = as.numeric(gsub("[^0-9]", "", milage)),
price = as.numeric(gsub("[^0-9]", "", price))
)
cars_filtered <- cars %>%
filter(price < 200000, milage < 300000, model_year >=1990) %>%
mutate(
brand = fct_lump_min(brand, min = 50),
model = fct_lump_n(model, n = 15),
clean_title = factor(clean_title),
accident = factor(accident),
log_price = log(price)
)
```
The price was heavily right skewed. Using **log(price)** imporves the normality of the residuals and satisfies assumptions.
Row
-----------------------------------------------------------------------
### Price Distribution {data-height=400}
```{r}
ggplot(cars_filtered, aes(x = price)) +
geom_histogram(fill="skyblue", bins=50) +
scale_x_continuous(labels= scales::comma) +
labs(title="Distribution of Used Car Prices", x="Price", y="Count")
```
### Log Price Distribution {data-height=400}
```{r}
ggplot(cars_filtered, aes(x = log_price)) +
geom_histogram(fill="lightgreen", bins=50) +
labs(title="Distribution of Log-Transformed Price", x="Log(Price)", y="Count")
```
Exploratory Data Analysis
===
Column {.tabset data-width=450}
------
### Price vs Mileage
```{r}
ggplot(cars_filtered, aes(x = milage, y = price)) +
geom_point(alpha = 0.5) +
scale_x_continuous(labels = scales::comma) +
scale_y_continuous(labels = scales::comma) +
labs(
x = "Mileage",
y = "Price (USD)",
title = "Used Car Price vs Mileage (Filtered)"
)
```
### Price vs Model Year
```{r}
ggplot(cars_filtered, aes(x = model_year, y = price)) +
geom_point(alpha = 0.5, color = "blue") +
scale_y_continuous(labels = scales::comma) +
labs(x = "Model Year", y = "Price (USD)", title = "Used Car Price vs Model Year")
```
### Price vs Brand
```{r}
ggplot(cars_filtered, aes(x = brand, y = price)) +
geom_boxplot() +
scale_y_continuous(labels = scales::comma) +
labs(
x = "Brand",
y = "Price ($)",
title = "Used Car Price by Brand"
) +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
```
### Price vs Model
```{r}
ggplot(cars_filtered, aes(x = model, y = price)) +
geom_boxplot() +
scale_y_continuous(labels = scales::comma) +
labs(
x = "Model",
y = "Price ($)",
title = "Used Car Price by Model"
) +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
```
### Price vs Accident History
```{r}
ggplot(cars_filtered, aes(x = accident, y = price)) +
geom_boxplot() +
scale_y_continuous(labels = scales::comma) +
labs(
x = "Accident History",
y = "Price ($)",
title = "Used Car Price by Accident History"
)
```
### Price vs Title Status
```{r}
ggplot(cars_filtered, aes(x = clean_title, y = price)) +
geom_boxplot() +
scale_y_continuous(labels = scales::comma) +
labs(
x = "Clean Title",
y = "Price ($)",
title = "Used Car Price by Clean Title Status"
)
```
Row
-----------------------------------------------------
### Exploratory Data Analysis {data-height=400}
**Price vs Mileage:** Showed a strong negative trend, meaning that cars with a higher mileage tend to have lower price suggesting a nonlinear relationship. With a regression coefficient of (−7.62 × 10⁻⁶,), meaning **for every additional 10,000 miles, the expected price decreases by about 7.6%**. This showed to be one of the strongest predictors of used car price.
**Price vs Model Year:** Showed that newer vehicle tended to have higher costs compared to older vehicles. The regression coefficient for model year, (0.0467), means that **for each additional year that the car is newer, the expected price increases by about 4.78%**. This aligns with the EDA that newer cars are more expensive
**Price vs Title Status:** The box plot comparing cars with clean titles vs non clean titles showed that clean title cars tended to have a slightly higher price, but there was a lot of overlap between groups. This was confirmed in the regression. Clean title (p=0.64) was not a significant predictor, meaning that after controlling for predictors such as mileage and model year, title status does not explain for much variation in price.
**Price vs Model:** The model shows that a car's model has a substantial impact on it's price. High end **sports cars and luxury vehicles** such are Porsche 911's and BMW M Series are associated with significantly higher prices, especially compared to the base model of the same vehicles. In contrast to this, more common vehicles such as the Mustang GT and Jeep Wrangler, did not show a statistically significant difference in price from the baseline model.
### Methods {data-height=400}
```{r, include=FALSE}
model <- lm(log(price) ~ milage + model_year + model + clean_title, data = cars_filtered)
summary(model)
```
Diagnostics
===
Column {.tabset data-width=350}
-----------------------------------------------------------------------
### Diagnostics
**Residuals vs Fitted**
The Residuals vs Fitted plot evaluates whether the relationship between the predictors and price is linear and whether the residuals have constant variance. The model reveals a strong curved pattern with a dip in the middle, this indicates the presence of influential outliers.
**Q-Q Residuals**
The Q-Q Plot assesses whether residuals follow a normal distribution. In this case, the points deviate sharply from the diagonal line in both tails, indicating extreme outliers in the data. Requiring more cleaning or transforming.
**Scale-Location**
The Scale-Location plot checks whether the residuals exhibit constant variance across levels of the fitted values. The red line slopes upward. This indicates that as the price for a used car goes higher, the error for the model's prediction increases.
**Cook's Distance**
The Cook's Distance plot identifies observations that exert an unusually large influence on the regression estimates. Several observations in the model (833, 2256, and 2708) stand out. These influential observations have the ability to shift the regression coefficients. These points should be investigated further.
Column {.tabset data-width=350}
-----------------------------------------------------------------------
### Residuals vs Fitted
```{r}
plot(model, which = 1)
```
### Q-Q Residuals
```{r}
plot(model, which = 2)
```
### Scale-Location
```{r}
plot(model, which = 3)
```
### Cook's Distance
```{r}
plot(model, which = 4)
```
Conclusion
===
Row
-------
### **Conclusion**
This project explored the primary factors that drive used car prices using a cleaned and filtered data set of online listings. Across the exploratory analysis and regression modeling, the two main predictors that stood out were mileage and model year. Vehicles with higher mileage tended to sell for significantly less, whole newer vehicles commanded higher prices. Brand and car model also played a substantial role with luxury and performance vehicles standing on top. Accident history showed small differences when holding other factors constant.
The multiple linear regression model largely supported this idea and revealed that a combination of model year, mileage, and car model can explain a substantial amount of variability in used car prices. While the model captured broad pricing patterns and provides reasonable insight for first time car buyers, the diagnostics indicated several signs of non linearity and influential outliers. As a result of this, the model should be interpreted as a baseline more than a full out predictive tool.
### **Limitations**
While this analysis provided useful insight into predicting used car price, several limitations should be acknowledged. First, the **diagnostic plots showed signs of non linearity**, suggesting that the regression model did not capture the full relationship in the data, especially when it came to mileage at the higher price levels. Additionally, **the data contained several very influential outliers**, which may have represented unusually priced vehicles and can disproportionately affect the regression coefficients. The decision to **lump brands and models** into broader categories helped for simplifying visualizations, but it also removed the detail from less common vehicles, for example specific trims. The data is limited in different factors such as **location, trim level, and optional packages (luxury or performance)** that may have played a major role in pricing in the real world. Finally, because the data comes from **online listing rather than final sale prices**, it may reflect higher prices due to the subtraction of the negotiation process which is a very common process when buying a car and often lowers the price at least a little bit.
Column {.tabset data-width=350}
-----------------------------------------------------------------------
### Personal Information
Jack Sarsen
B.S. Statistics, University of Dayton
Final Project for MTH 369 (Regression and Linear Models)
linkedin.com/jacksarsen